[None][fix] write per-rank torch profile traces#13536
[None][fix] write per-rank torch profile traces#13536GavinZhu-GMI wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe change adds rank-specific filename handling to torch profiler trace exports. When tracing is enabled via environment variable, the export filename is rewritten to include the global rank identifier, preventing concurrent writes from multiple ranks to a single file. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/bot run |
|
PR_Github #45826 [ run ] triggered by Bot. Commit: |
|
PR_Github #45826 [ run ] completed with state
|
|
/bot run |
|
PR_Github #45952 [ run ] triggered by Bot. Commit: |
|
PR_Github #45952 [ run ] completed with state
|
|
@tensorrt-cicd Cannot see the exact failure of blossom-ci, can you share the details of the pipeline? |
511c041 to
c9adab8
Compare
CI flakiness, sorry about that. I'll retry. |
|
/bot run |
2 similar comments
|
/bot run |
|
/bot run |
|
PR_Github #46145 [ run ] triggered by Bot. Commit: |
|
PR_Github #46145 [ run ] completed with state
|
c9adab8 to
f4529ce
Compare
|
/bot run |
|
PR_Github #46170 [ run ] triggered by Bot. Commit: |
|
PR_Github #46170 [ run ] completed with state
|
|
/bot run |
|
PR_Github #46227 [ run ] triggered by Bot. Commit: |
|
PR_Github #46227 [ run ] completed with state
|
PyExecutor reads TLLM_TORCH_PROFILE_TRACE directly and every rank calls torch_profiler.export_chrome_trace() on the same path. When TP/PP/DP > 1, the concurrent writes interleave and the resulting file fails to parse in Chrome tracing / Perfetto (bad control character / unterminated string at the byte where one rank's output overran another's). Append the rank to the env-provided path before the first use so each rank writes to its own file. Matches SGLang's scheduler_profiler_mixin filename convention: the user supplies a base path, the runtime adds the per-rank suffix automatically. Example: TLLM_TORCH_PROFILE_TRACE=/tmp/trace.json now produces /tmp/trace-rank-0.json, /tmp/trace-rank-1.json, etc. Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>
f4529ce to
9c360e0
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #46427 [ run ] triggered by Bot. Commit: |
|
PR_Github #46427 [ run ] completed with state |
Summary
PyExecutorreadsTLLM_TORCH_PROFILE_TRACEdirectly and every rank callstorch_profiler.export_chrome_trace()on the same path. At TP/PP/DP > 1 the concurrent writes interleave and the resulting file fails to parse in Chrome tracing / Perfetto (bad control character / unterminated string at the byte where one rank's output overran another's).Fix: append the rank to the env-provided path before the first use so each rank writes to its own file.
TLLM_TORCH_PROFILE_TRACE=/tmp/trace.jsonnow produces/tmp/trace-rank-0.json,/tmp/trace-rank-1.json, etc. — same convention SGLang'sscheduler_profiler_mixinalready uses (user supplies a base, runtime adds the per-rank suffix automatically).Why this matters
clintg6:fix/multi-gpu-torch-profiling). It received only acoderabbitaibot review and was closed by the author 6 days later without human engagement. This PR re-opens with a smaller diff (8 lines vs ~15+ refactor) and fresh validation evidence.Validation
Reproduced on TRT-LLM
1.3.0rc11with TP=8 on 8×H200 servingzai-org/GLM-5.1-FP8:Before patch (single shared path):
The corrupt byte range contains
"name": "void at::native::vectorized_elemtruncated mid-string by another rank's{for the next event.After patch (per-rank paths, same env value, same workload):
Distinct sizes confirm no shared clobbering; rank-0 parses with 63,079 events.
Backwards compatibility
TLLM_TORCH_PROFILE_TRACEenv name unchanged.<base>-rank-0<ext>instead of<base>. This is the same compromise SGLang made and is the only sane disambiguation if you ever scale to multi-rank.Test plan
json.loadpy_executor.py:886Reviewers
cc @NVIDIA/trt-llm-torch-runtime-devs @byshiue @xxi-nv — re-opening the multi-rank torch-profiler trace fix from #9022 (which went stale without human review) with a smaller diff and concrete reproducer. Would appreciate eyes here so distributed profiling stops silently corrupting traces.
Summary by CodeRabbit